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Preface 



I have just gone through my email archive. The first exchange of messages with M. J. 
Cardoso, M. D., dates from the end of 2001. It was Pedro Cardoso, her brother and my 
superior at INESC Porto, where I was a researcher and developer since 1999, who had put 
us in contact. She had just started her PhD and would probably need some assistance from 
someone skilled in software development and mathematics. Although I was already enrolled 
to start my own PhD in the beginnings of 2002, I accepted. 

Simultaneously, since the end of 2000, I had been working at INESC Porto for the MetaVision 
project. The MetaVision project proposed an innovative electronic production system to 
reduce the cost of film production and to allow more artistic flexibility in shooting and film 
editing. It also provided the enabling technology for the integration of real and virtual 
images at source quality for film production and in TV studios in the compressed domain. 

2004 brought with it the end of the MetaVision project. That represented some free time 
that was exploited to fill some gaps detected in my mathematical background, by engaging in 
a masters in engineering mathematics. This master offers a solid formation in diverse areas 
of applied mathematics, divided in four main areas, comprising the analysis and processing 
of information. 

Coincidently, 2004 would also be the year of tight collaboration with M. J. Cardoso. Her 
aim was to develop an objective measure for the overall cosmetic result of breast cancer 
conservative treatment. When confronted with the problem, a machine learning approach 
(a topic that I was delving in the master's classes) emerged as the right move. A suggestion 
was made to predict the overall cosmetic result from a few simple measures taken from the 
patient. I knew already some tools to tackle the problem but only superficially. That led me 
to select the automatic classification and pattern recognition, lectured by Professor Joaquim 
F. Pinto da Costa, as one of the modules to attend. 

The application of some of the state of the art methods for ordinal data, to the problem 
at hand, sparkled the interest on this specific topic of classification. What I learned, the 
breakthroughs that were accomplished, is what I would like to share with you. 

This work became possible due to the support of Professor Joaquim F. Pinto Costa, my 
supervisor. I also discussed ideas presented in this thesis with L. Gustavo Martins, Luis F. 
Teixeira and M. Carmo Sousa. They also read the manuscript, providing important feedback. 
I would like to express my deep gratitude to all of them. 

Jaime dos Santos Cardoso 
September 2005 



Abstract 



Predictive learning has traditionally been a standard inductive learning, where different sub- 
problem formulations have been identified. One of the most representative is classification, 
consisting on the estimation of a mapping from the feature space into a finite class space. 
Depending on the cardinality of the finite class space we are left with binary or multiclass 
classification problems. Finally, the presence or absence or a "natural" order among classes 
will separate nominal from ordinal problems. 

Although two-class and nominal classification problems have been dissected in the literature, 
the ordinal sibling has not yet received a lot of attention, even with many learning problems 
involving classifying examples into classes which have a natural order. Scenarios in which it 
is natural to rank instances occur in many fields, such as information retrieval, collaborative 
filtering, econometric modeling and natural sciences. 

Conventional methods for nominal classes or for regression problems could be employed to 
solve ordinal data problems; however, the use of techniques designed specifically for ordered 
classes yields simpler classifiers, making it easier to interpret the factors that are being used 
to discriminate among classes, and generalises better. Although the ordinal formulation 
seems conceptually simpler than nominal, some technical difficulties to incorporate in the 
algorithms this piece of additional information - the order - may explain the widespread use 
of conventional methods to tackle the ordinal data problem. 

This dissertation addresses this void by proposing a nonparametric procedure for the classifi- 
cation of ordinal data based on the extension of the original dataset with additional variables, 
reducing the classification task to the well-known two-class problem. This framework unifies 
two well-known approaches for the classification of ordinal categorical data, the minimum 
margin principle and the generic approach by Frank and Hall. It also presents a probabilistic 
interpretation for the neural network model. A second novel model, the unimodal model, 
is also introduced and a parametric version is mapped into neural networks. Several case 
studies are presented to assert the validity of the proposed models. 

Keywords: machine learning, classification, ordinal data, neural networks, support vector 
machines 



Resumo 



Tradicionalmente, a aprendizagem automatica predictiva tern sido uma aprendizagem indu- 
tiva padrao, onde diferentes sub-problemas foram sendo formulados. Um dos mais represen- 
tatives e o da classificagao, que consiste na estimagao de uma funcao do espago dos atributos 
para um espago finito de classes. Dependendo da cardinalidade do espago das classes 
temos um problema de classificagao binario ou multi-classe. Finalmente, a existencia ou 
ausencia de uma ordem "natural" entre as classes distingue problemas multi-classe nominais 
de problemas multi-classe ordinais. 

Embora os problemas de classificagao binaria e multi-classe nominal tenham sido dissecados 
na literatura, o problema-irmao de dados ordinais tern passado despercebido, mesmo com 
muitos problemas de aprendizagem automatica envolvendo a classificagao de dados que 
possuem uma ordem natural. Cenarios em que e natural ordenar exemplos ocorrem nas 
mais diversas areas, tais como pesquisa ou recuperagao de informagao, filtragem colaborativa, 
modelagao economica e ciencias naturais. 

Os metodos convencionais para classes nominais ou para problemas de regressao podem ser 
empregues para resolver o problema ordinal; contudo, a utilizagao de tecnicas desenvolvidas 
especificamente para classes ordenadas produz classificadores mais simples, facilitando a 
interpretagao dos factores que estao a desempenhar um papel importante para discriminar 
as classes, e generaliza melhor. Embora a formulagao ordinal aparente ser conceptualmente 
mais simples que a nominal, algumas dificuldades tecnicas para incorporar nos algoritmos 
este pedago de informagao adicional - a ordem - pode explicar o uso generalizado de metodos 
convencionais para atacar o problema de dados ordinais. 

Esta dissertagao aborda este vazio, propondo um metodo nao-parametrico para a classificagao 
de dados ordinais baseado na extensao do conjunto de dados original com variaveis adicionais, 
reduzindo o problema de classificagao ao familiar problema de classificagao binaria. A 
metodologia proposta unifica duas abordagens bem estabelecidas para a classificagao de 
dados ordinais, o prinefpio da margem minima e o metodo generico de Frank e Hall. E 
tambem apresentado uma interpretagao probabilistica para o modelo mapeado em redes 
neuronals. Um segundo modelo, o modelo unimodal, e tambem introduzido e uma versao 
parametrica e mapeada em redes neuronals. Varios casos de estudo sao apresentados para 
evidenciar a validade dos modelos propostos. 

Palavras-chave: aprendizagem automatica, classificagao, dados ordinais, redes neuronals, 
maquinas de vectores de suporte 
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Chapter 1 



Introduction 



1.1 Problem formulation 



Predictive learning has traditionally been a standard inductive learning, with two modes of 
inference: system identification (with the goal of density estimation) and system imitation 
(for generalization). Nonetheless, predictive learning does not end with inductive learning. 
While with inductive learning the main assumptions are a finite training set and a large 
(infinite), unknown test set, other problem settings may be devised. 

The transduction formulation [1] assumes a given set of labeled, training data and a finite, 
known set of unlabeled test points, with the interest to estimate the class labels only at these 
points. The selection type of inference is, in some sense, even simpler than transduction: 
given a set of labeled training data and unlabeled test points, select a subset of test points 
with the highest probability of belonging to one class. Selective inference needs only to select 
a subset of m test points, rather than assign class labels to all test points. An hierarchy of 
types of inference can be, not exhaustively, listed [2]: identification, imitation, transduction, 
selection, etc. 

Under the traditional inductive learning, different (sub-)problem formulations have been 
identified. Two of the most representative are regression and classification. While both 
consist on estimating a mapping from the feature space, the regression looks for a real- 
valued function defined in the feature space, whereas classification maps the feature space 
into a finite class space. Depending on the cardinality of the finite class space we are left 
with two-class or multiclass classification problems. Finally, the presence or absence of a 
"natural" order among classes will separate nominal from ordinal problems: 
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1.2 Motivation 

Although two-class and nominal data classification problems have been dissected in the 
literature, the ordinal sibling has not yet received a lot of attention, even with many learning 
problems involving classifying examples into classes which have a "natural" order. Settings 
in which it is natural to rank instances arise in many fields, such as information retrieval [3], 
collaborative filtering [4], econometric modeling [5] and natural sciences [6]J 

Conventional methods for nominal classes or for regression problems could be employed to 
solve ordinal data problems ([8-10]); however, the use of techniques designed specifically for 
ordered classes results in simpler classifiers, making it easier to interpret the factors that 
are being used to discriminate among classes [5]. Although the ordinal formulation seems 
conceptually simpler than nominal, some technical difficulties to incorporate the piece of 
additional information - the order - in the algorithms may explain the widespread use of 
conventional methods to tackle the ordinal data problem. 



1.3 Tools 

As seen, there are relatively few predictive learning formulations; however, the number of 
learning algorithms, especially for the inductive case, is overwhelming. Many frameworks, 
adaptations to real-life problems, intertwining of base algorithms were, and continue to be, 
proposed in the literature; ranging from statistical approaches to state of the art machine 
learning algorithms, parametric to non parametric procedures, a plethora of methods is 
available to users. 

Our study will not attempt to cover them all. Limited by time (and competence to add 
significant contributions), two major algorithms will be the "horsepower" of our work: 
support vector machines and neural networks. Other base approaches, such as decision trees, 

*It is worth pointing out that distinct tasks of relation learning, where an example is no longer associated 
with a class or rank, which include preference learning and reranking [7], are topics of research on their own. 
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for which interesting algorithms for ordinal data have already been proposed ([11-13]), will 
have to wait for a next opportunity. 



1.3.1 The ABC of support vector machines 

Consider briefly how the SVM binary classification problem is formulated [1].* 

For two training classes linearly separable in the selected feature space, the distinctive idea of 
SVM is to define a linear discriminant function g(x) = w'x + 6 in the feature space bisecting 
the two training classes and characterized by g(x) = 0. However, there may be infinitely 
many such surfaces. To select the surface best suited to the task, the SVM maximizes 
the distance between the decision surface and those training points lying closest to it (the 
support vectors). Considering the training set {xj; }, where k = 1,2 denotes the class 
number, i = 1, • • • is the index within each class, it is easy to show [1] that maximizing 
this distance is equivalent to solving 

1 t 
mm —ww 

w,b 2 

t -( w *xf ) +6)>+l i = !,■■■ ,e 1 (L1) 
+(w*xf } + 6) > +1 i = ,£ 2 



If the training classes are not linearly separable in feature space, the inequalities in (jl.lj) 
can be relaxed using slack variables and the cost function modified to penalise any failure to 
meet the original (strict) inequalities. The problem becomes 



mm 



IwWc£^ sgn ({<*>) 

k=l i=l 

+ b)>+l-if ) i = !,••■, 4 ( L2 ) 



w X, 



s.t. +(w*xf ) + b) > +1 - Cj 2) i = 1, 
> 



"2 



The constraint parameter C controls the tradeoff between the dual objectives of maximizing 
the margin of separation and minimizing the misclassification error. For an error to occur, 
the corresponding £j must exceed unity so Y%=i Si=i s S n ) is an upper bound on the 
number of the training errors, that is ^ Zo-i(/( x i^)> k), where /(x^) is the classification 
rule induced by the hyperplane w*x + b. Hence the added penalty component is a natural 
way to assign an extra cost for errors. 

However, optimization of the above is difficult since it involves a discontinuous function 
*The following introduction to SVMs is based largely on [14]. 
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sgn (). As it is common in such cases, we choose to optimize a closely related cost function, 
and the goal becomes to 



under the same set of constraints as (|1.2J) . 

In order to account for different misclassification costs or sampling bias, the model can 
be extended to penalise the slack variables according to different weights in the objective 
function [15]: 



1.3.2 The ABC of neural networks 

Neural networks were originally developed from attempts to model the communication and 
processing information in the human brain. Analogous to the brain, a neural network 
consists of a number of inputs (variables), each of which is multiplied by a weight, which 
is analogous to a dendrite. The products are summed and transformed in a "neuron" (i.e. 
simple processing unit) and the result becomes an input value for another neuron [16]. 

A multilayer feedforward neural network consists of an input layer of signals, an output 
layer of output signals, and a number of layers of neurons in between, called hidden layers 
[17-19]. It was shown that, under mild conditions, these models can approximate any decision 
function and its derivatives to any degree of accuracy. 

To use a neural network for classification, we need to construct an equivalent function 
approximation problem by assigning a target value for each class. For a two-class problem 
we can use a network with a single output, and binary target values: 1 for one class, and for 
the other. We can thus interpret the network's output as an estimate of the probability that 
a given pattern belongs to the '1' class. The training of the network is commonly performed 
using the popular mean square error. 

For multiclass classification problems (1-oi-K, where K > 2) we use a network with K 
outputs, one corresponding to each class, and target values of 1 for the correct class, and 
otherwise. Since these targets are not independent of each other, however, it is no longer 
appropriate to use the same error measure. The correct generalization is through a special 
activation function (the softmax) designed so as to satisfy the normalization constraint on 
the total probability [10]. 




fc=l i=l 





(1.4) 



However, this approach does not retain the ordinality or rank order of the classes and is not, 
therefore, appropriate for ordinal multiclass classification problems. An clear exception is 
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the PRank algorithm by Crammer [20], and its improvement by Harrington [21], which is a 
variant of the perceptron algorithm. As we progress in this work, several other approaches 
will be presented, making use of generic neural networks. 

1.4 Thesis' structure 

This thesis introduces in chapter El the data replication method, a nonpar ametric procedure 
for the classification of ordinal data based on the extension of the original dataset with 
additional variables, reducing the classification task to the well known two-class problem. 
Starting with the simpler linear case, the chapter evolves to the nonlinear case; from there 
the method is extended to incorporate the procedure of Frank and Hall [22]. Finally, the 
generic version of the data replication method is presented, allowing partial constraints on 
variables. 

In chapter 01 the data replication method is mapped into two important machine learning 
algorithms: support vector machines and neural networks. A comparison is made with a 
previous SVM approach introduced by Shashua [4], the minimum margin principle, showing 
that the data replication method leads essentially to the same solution, but with some key 
advantages. The chapter is elegantly concluded with a reinterpretation of the neural network 
model as a generalization of the ordinal logistic regression model. 

The second novel model, the unimodal model, is introduced in chapter 01 and a parametric 
version is mapped into neural networks. A parallelism of this approach with regression 
models concludes the chapter. 

Chapter El introduces the experimental methodology and the algorithms that were compared 
in the conducted experiments reported in the succeeding chapters. Finally, results are 
discussed, conclusions are drawn and future work is oriented in chapter 

1.5 Contributions 

We summarize below the contributions of this thesis towards more efficient and parsimonious 
methods for classification of ordinal data. In this thesis we have 

1 . introduced in the machine learning community the data replication method, a nonpara- 
metric procedure for the classification of ordinal categorical data. Presented also the 
mapping of this method for neural networks and support vector machines; 

2. unified under this framework two well-known approaches for the classification of ordinal 
categorical data, the minimum margin principle [4] and the generic approach by Frank 
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and Hall [22]. It was also presented a probabilistic interpretation for the neural network 
model; 

3. introduced the unimodal model, mapped to neural networks, a second approach for 
the classification of ordinal data, and established links to previous works. 

Publications related to the thesis 

[23] J. S. Cardoso, J. F. P. da Costa, and M. J. Cardoso, "SVMs applied to objective aesthetic 
evaluation of conservative breast cancer treatment," in Proceedings of International Joint 
Conference on Neural Networks (IJCNN) 2005, 2005, pp. 2481-2486. 

[6] J. S. Cardoso, J. F. P. da Costa, and M. J. Cardoso, "Modelling ordinal relations 
with SVMs: an application to objective aesthetic evaluation of breast cancer conservative 
treatment," (ELSEVIER) Neural Networks, vol. 18, pp. 808-817, june-july 2005. 

[24] J. F. P. da Costa and J. S. Cardoso, "Classification of ordinal data using neural 
networks," in Proceedings of European Conference Machine Learning (ECML) 2005, 2005, 
pp. 690-697. 

[25] J. S. Cardoso and J. F. P. da Costa, "Learning to classify ordinal data: the data 
replication method," (submitted) Journal of Machine Learning Research. 



Chapter 2 

The data replication method 



Let us formulate the problem of separating K ordered classes Ci,--- ,Ck- Consider the 
training set {x^}, where k = 1, ■ ■ ■ , K denotes the class number, i = 1, • ■ ■ ,1^ is the index 
within each class, and x^ G W, with p the dimension of the feature space. Let t = Y^k=i 
be the total number of training examples. 

Suppose that a if-class classifier was forced, by design, to have K — 1 noncrossing bound- 
aries, with boundary i discriminating classes C±, ■ ■ ■ ,Ci against classes Ci+i, • • • , Ck- As the 
intersection point of two boundaries would indicate an example with three or more classes 
equally probable - not plausible with ordinal classes -, this strategy imposes an (arguably) 
intuitive restriction. With this constraint emerges a monotonic model, where a better value 
in an attribute does not lead to a lower decision class. For the linear case, this translates 
to choosing the same weighted sum for all decisions - the classifier would be just a set of 
weights, one for each feature, and a set of biases, the scale in the weighted sum. By avoiding 
the intersection of any two boundaries, this simplified model captures better the essence 
of the ordinal data problem. Another strength of this approach is the reduced number of 
parameters to estimate, which may lead to a more robust classifier, with greater capacity for 
generalization. 

This rationale leads to a straight-forward generalization of the two-class separating hyper- 
plane [4]. Define K — 1 separating hyperplanes that separate the training data into K 
ordered classes by modeling the ranks as intervals on the real line - an idea with roots in 
the classical cumulative model, [3,26]. The geometric interpretation of this approach is to 
look for K — 1 parallel hyperplanes represented by vector w S W and scalars b±, ■ ■ ■ , bx-i, 
such that the feature space is divided into equally ranked regions by the decision boundaries 
w*x + 6 r , r = {!,-• • ,K - 1}. 



^Some portions of this chapter appeared in [6]. 
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It would be interesting to accommodate this formulation under the two-class problem. That 
would allow the use of mature and optimized algorithms, developed for the two-class problem. 
The data replication method allows us to do precisely that. 



2.1 Data replication method — the linear case 



To outline the rationale behind the proposed model for the linear case, consider first an 
hypothetical, simplified scenario with three classes in M. 2 . The plot of the dataset is presented 



in figure 2.1(a 



Using a transformation from the M 2 initial feature-space to a K 3 feature space, replicate each 



original point, according to the rule (figure 2.1(b) ): 



x G 



where h = const G 



Observe that each any two points created from the same starting point differ only in the 
new variable. 



Define now a binary training set in the high-dimensional space according to (figure 2.1(c) ): 
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(2.1) 



A linear two-class classifier can now be applied to the extended dataset, yielding a hyperplane 



separating the two classes - figure 2.1(d) The intersection of this hyperplane with each of 
the subspace replicas (by setting X3 = and X3 = h in the equation of the hyperplane) can 



be used to derive the boundaries in the original dataset - figure 2.1(e) 



Although the foregoing analysis enables to classify unseen examples in the original dataset, 
classification can be done directly in the extended dataset, using the binary classifier, without 
explicitly resorting to the original dataset. For a given example G R 2 , classify each of its two 
replicas G M 3 , obtaining a sequence of two labels G {Ci,C2} 2 - Prom this sequence infer the 
class according to the rule 



c 2 c 



2^-2 



c 3 



With the material on how to construct a set of optimal hyperplanes for the toy example, 
we are now in a position to formally describe the construction of a i^T-class classifier for 
ordinal classification. Define eo as the sequence of K — 2 zeros and e q as the sequence of 
K — 2 symbols 0, • • • , 0, h, 0, ■ ■ ■ ,0, with h in the q-th position. Considering the problem 
of separating K classes C\ , • ■ ■ , Ck with training set {xj> } , define a new high-dimensional 
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binary training dataset as 



CO 

x . 



Ci fc = 1 
C 2 k = 2, 



, min(/f, 1 + s) 



Ci k — max(l, q — s + 1), • • • , q 
C2 k = q + 1, • • • , min(K, q + s) 



(2.2) 



r 4 k) 

[ ejf-2 



where the role of parameter s 6 {1, 




max(l, A - 1 - s + 1), • ■ ■ , K - 1 
A" 

, X — 1} is to bound the number of classes, to the 
'left' and to the 'right', involved in the constraints of a boundary. This allows to control 
the increase of data points inherent to this method. The toy example in figure 2.1(b) was 
illustrated with s = 
the same solution. 



K— 1 = 2; setting s = 1 would result as illustrated in !2.21 with essentially 



Then construct a linear two-class classifier on this extended dataset; to classify an unseen 
example obtain a sequence of {K— 1) labels G {C\ , C2}^ K ~^ by classifying each of the (K—l) 
replicas in the extended dataset with the binary classifier. Note that, because the (K — 1) 
boundaries do not cross each other, there are only K different possible sequences. The target 
class can be obtained by summing one to the number of C2 labels in the sequence. 



2.2 Data replication method — the nonlinear case 

So far we have assumed linear boundaries between classes. There are important situations 
in which such a restriction does not exist, but the order of the classes is kept. Inspired by 
the data replication method just presented, we can look for boundaries that are level curves 
of some nonlinear function G(x) defined in the feature space. For the linear version we take 
G(x) = w*x. 

Extending the feature space and modifying to a binary problem, as dictated by the data 
replication method, we can search for a partially linear (nonlinear in the original variables 
but linear in the introduced variables) boundary G(x) = G(x) + w*e,; = 0, with w £ R K ~ 2 , 
and x = [ej- The intersection of the constructed high-dimensional boundary with each of 
the subspace replicas provides the desired (K — 1) boundaries. This approach is plotted in 
figure E21 for the toy example.^ 

t Although a partial linear function G(x) is the simplest to provide noncrossing boundaries in the original 
space (level curves of some function G(x)), it is by no means the only type of function to provide them. 
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2.3 A general framework 



As presented so far the data replication method allows only to search for parallel hyperplanes 
(level curves in the nonlinear case) boundaries. That is, a single direction is specified for all 
boundaries. In the quest for an extension allowing more loosely coupled boundaries, let us 
start by reviewing a method for ordinal data already presented in the literature. 



2.3.1 The method of Frank and Hall 



Frank and Hall [22] introduced a simple algorithm that enables standard classification 
algorithms to exploit the ordering information in ordinal prediction problems. First, the 
data is transformed from a .ff-class ordinal problem to K — 1 binary class problems. Training 
of the i-th classifier is performed by converting the ordinal dataset with classes C\ , • ■ • , Ck 
into a binary dataset, discriminating C±, ■ ■ ■ ,Ci against C«+i, • • • ,Ck', in fact it represents the 
test C x > i. To predict the class value of an unseen instance, the K — 1 binary outputs are 
combined to produce a single estimation. Any binary classifier can be used as the building 
block of this scheme. 

Observe that, under our approach, the z-th boundary is also discriminating C±, ■ ■ ■ ,Ci against 
Cf+i, - ■ ■ ,Ck', the major difference lies in the independence of the boundaries found with 
Frank and Hall method. 



2.3.2 A parameterized family of classifiers 



Up to now, when replicating the original dataset, the original p variables were the first p 
variables of the p + K — 2 variables of the new dataset, for each subspace replica, as seen in 
((Hi- 
Returning to the toy example, assume that the replication was done not according to (|2.1|) 
but instead using the following rule: 



o 2 
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o 2 
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(3) 
x i 
0_> 

. 



o 2 



eC 5 



(2.3) 



where O2 is the sequence of 2 zeros. Intuitively, by misaligning variables involved in the 
determination of different boundaries (variables in different subspaces), we are decoupling 
those same boundaries. 



Proceeding this way, boundaries can be designed almost independently (more on this later, 
when mapping to SVMs). In the linear case we have now four parameters to estimate, the 
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same as for two independent lines in M 2 . Intuitively, this new rule to replicate the data 
allows the estimation of the direction of each boundary essentially independently. 

The general formulation in Q2.2j) becomes 



(AO 

x i 

p{K-2) 

eo 



d k = l 

C 2 k = 2, ■ • • , mm(K, 1 + s) 



o 



p (,-i) 

p(K-q-l) 

e q -i 



E < 



Ci k = max(l, q — s + 1), • • • , g 
C 2 fc = g + 1, ■ ■ ■ , min(iT, g + s) 



(2.4) 



J p(K-2) 
(fc) 

ejf-2 



Ci = max(l, K — 1 — s + 1) 
C 2 k = K 



K-l 



where 0; is the sequence of I zeros, I E N. 

While the linear basic data replication method requires the estimation of (p — 1) + (K — 1) 
parameters, the new rule necessitates of (p — l)(K — 1) + {K — 1), the same as the Frank and 
Hall approach; this corresponds to the number of free parameters in {K — 1) independent 
p-dimensional hyperplanes. 

While this does not aim at being a practical alternative to Frank's method, it does paves 
the way for intermediate solutions, filling the gap between the totally coupled and totally 
independent boundaries. 

To constraint only the first j variables of the p initial variables to have the same direction in 
all boundaries, while leaving the (p — j) final variables unconstrained, we propose to extend 
the data according to 



r (*0ei -\ 1 




x< fc) (j + l:p) 
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°(p-j)(Jr-2) 
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Ci k = 1 

C 2 k = 2, • • • , mm(K, 1 + s) 



J (p-j')(K- 9 -l) 
e 9 -i 



C\ k = max(l, q — s + 1), ■ • • , q 
C 2 = <? + 1, • • • , mm(K, c/ + s) 



(2.5) 
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Ci fc = max(l,if-l-s + l),--- ,-fC-l 
C 2 k = K 



2.3. A GENERAL FRAMEWORK 



15 



With this rule [p— 1 — (j — 1)](K — 1) + [K — 1) + j — 1, j € {1, ■ ■ ■ parameters are to 
be estimated. 

This general formulation of the data replication method allows the enforcement of only the 
amount of knowledge (constraints) that is effectively known a priori, building the right 
amount of parsimony into the model. 
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(e) Linear solution in the original dataset. 
Figure 2.1: Proposed data extension model in a toy example. 
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Figure 2.2: Toy dataset replicated in R 3 , h = 1, s = 1. 




Figure 2.3: Nonlinear data extension model in the toy example. 



Chapter 3 



Mapping the data replication 
method to learning algorithms 



Suppose that examples in a classification problem come from one of K classes, numbered 
from 1 to K, corresponding to their natural order if one exists, and arbitrarily otherwise. 
The learning task is to select a prediction function /(x) from a family of possible functions 
that minimizes the expected loss. 

In the absence of reliable information on relative costs, a natural approach for unordered 
classes is to treat every misclassification as equally likely. This translates to adopting 
the non-metric indicator function Zo-i(/( x ), y) = if /(x) = y and £o_i(/(x),y) = 1 if 
/(x) ^ y, where /(x) and y are the predicted and true classes, respectively. Measuring the 
performance of a classifier using the /o-i loss function is equivalent to simply considering 
the misclassification error rate. However, for ordered classes, losses that increase with the 
absolute difference between the class numbers are more natural choices in the absence of 
better information [5]. This loss should be naturally incorporated during the training period 
of the learning algorithm. 

A risk functional that takes into account the ordering of the classes can be defined as 

R(f)= E\l s (f(^),k)] (3.1) 



with 

I s (/(x«),fc) =min(|/(x«)-A;|, S ) 

The empirical risk is the average of the number of mistakes, where the magnitude of a mistake 
is related to the total ordering: R s emp (f) = \ £f =1 Zti l ° f/( x f \ k ) 



"emp v 

Arguing as [3], we see that the role of parameter s (bounding the loss incurred in each 
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example) is to allow for an incorporation of a priori knowledge about the probability of 
the classes, conditioned by x, P(Cfc|x). This can be treated as an assumption on the 
concentration of the probability around a "true" rank. Let us see how all this finds its 
place with the data replication method. 



3.1 Mapping the data replication method to SVMs 



3.1.1 The minimum margin principle 



Let us formulate the problem of separating K ordered classes Ci, • • • , Ck in the spirit of 
SVMs. 

Starting from the generalization of the two-class separating hyperplane presented in the 
beginning of previous section, let us look for K — 1 parallel hyperplanes represented by 
vector w G W and scalars b\, ■ ■ ■ , bx-i, such that the feature space is divided into equally 
ranked regions by the decision boundaries w*x + b r , r = 1, ■ ■ ■ , K — 1. 

Going for a strategy to maximize the margin of the closest pair of classes, the goal becomes 
to maximize min|w*x + 6j|/||w||. Recalling that an algebraic measure of the distance of a 
point to the hyperplane w*x + b is given by (w*x + 6)/||w||, we can scale w and b{ so that 
the value of the minimum margin is 2/||w||. 

The constraints to consider result from the K — 1 binary classifications related to each hy- 
perplane; the number of classes involved in each binary classification can be made dependent 
on a parameter s, as depicted in figure 13.11 For the hyperplane q G {1, • • • , K — 1}, the 
constraints result as 



-(w^l'+bg) > +1 fc = max(l, q — s + 1),--- ,q 
+(w*xf ) + 6g) > +1 k = q + l,--- ,mm(K,q + s) 
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c, I c 2 I c, 

{C,} I (C 2 ,C 3 ) 



{C„ C 2 } 1 {C 3 , C 4 } 



{C 2 ,C 3 } 1 {CC] 



{C 3 ,C) ■ {C,l 

Figure 3.1: Classes involved in the hyperplanes constraints, for K = 5, s = 2. 



Our model can now be summarized as: 

1 t 

mm -ww 

w,&i 2 



-(w'xf'+bi) > +1 fc = l 

+(w*xf ) +6i) >+l k = 2,--- ,mhx(K,l + 8 ) 

-(w*xf 5 + & g ) > +1 fc = max(l,g-s + l),--- ,5 (3.2) 

s.t. +(w t xf ) +b 9 ) > +1 fc = g + l,... ,mw.(K,q + s) 

-(w*^ +6k-i) > +1 fc = max(l,^- «),••• 

+(w*xf 5 + 6 K -i) >+l = # 

d fc) > o 

Reasoning as in the two-class SVM for the non-linearly separable dataset, the model becomes 



mm —ww 



K-1 min(K,<3 + s) l k 
g — 1 fc— max(l,q — s + 1) 2=1 



+ 



s.t. + 



w'x« 



+ 6i) 
+ 6i) 



w*xf 5 + 6 9 
w'x (t) + 6„ 



w'xf 



6k- 



>+i-di 5 



> +i 

> +1 



> + 1 - k = max(l, - s), ■ ■ ■ , K - 1 



, min(A', 1 + s) 



max(l, q - s + 1), • • • ,q 
q + 1, • ■ • ,min(K,q + s) 



wX w +&k-i) >+!-$ 



(JO 



K 



£ (k) > 



(3.3) 



Since each point is replicated 2.s times, it is involved in the definition of 2.s boundaries 



(see figure l3~Tj) : consequently, it can be shown to be misclassified min(|/(x 



k\,s) 
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l s (f{x.\ ),k) times, where /(x- ) is the class estimated by the model. As with the two-class 

example, Ylfji EHnSJg-s+l) E^i s S n is an u PP erbound of Ek Ei *'(/(**)> fc )> 

proportional to the empirical risk.' 

Continuing the parallelism with the two-class SVM, the function to minimize simplifies to 



K-l min{K,q+s) £ k 

min i w * w+c y y yv^ (3.4) 

9=1 fc=max(l,q-s+l) i=l 

subject to the same constraints as (|3.3|) . 

As easily seen, the proposed formulation resembles the fixed margin strategy in [4]. However, 
instead of using only the two closest classes in the constraints of an hyperplane, more 
appropriate for the loss function Zo-i()j we adopt a formulation that captures better the 
performance of a classifier for ordinal data. 

Two issues were identified in the above formulation. First, this is an incompletely specified 
model because the scalars bi are not well defined. In fact, although the direction of the 
hyperplanes w is unique under the above formulation (proceeding as [1] for the binary case), 
the scalars b\, ■ ■ ■ , bx-i are not uniquely defined, figure 




Figure 3.2: Scalar 62 is undetermined over an interval under the fixed margin strategy. 



Another issue is that, although the formulation was constructed from the two-class SVM, it 
is no longer solvable with the same algorithms. It would be interesting to accommodate this 
formulation under the two-class problem. Both issues are addressed by mapping the data 
replication method to SVMs. 

^Two parameters named s have been introduced. In section II the s parameter bounds the number of 
classes involved in the definition of each boundary, controlling this way the growth of the original dataset. 
The parameter s introduced in equation 13. II bounds the loss incurred in each example. Here we see that 
they are the same parameter. 
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3.1.2 The oSVM algorithm 



In order to get a better intuition of the general result, consider first the toy example 
previously presented. 



The binary SVM formulation for the extended and binarized training set can be described 
as (with - r w 1 ™ ^ n» 2 > 



w = [ W3 \ , w G 



nuiiw.i) 5 ww 



+ 
+ 

s.t. 



+ 



'if] 


+ b) > +i 




+ b)>+i 


tn 


+ b)>+i 


if] 


+ b)>+l 


if] 


+ b)>+l 


if] 


+ b)>+l 



(3.5) 



But because < 



w 



wf ft '] =w*Xi + w 3 h 
lation above simplifies to 



and renaming b to b\ and b + w^h to 62 the formu- 



min w ,6 li 6 2 jW w -|- ^ 



-(w'xf + 61) > +1 
+ (w'xf ) +61) > +1 



s.t. 



+0 



,t v (l) 



+ 61) > +1 



(3.6) 



-(w f x^ j + b 2 ) > +1 
-(w'xf +62) > +1 
+ (w'xf ) +62) > +1 

Two points are worth to mention: a) this formulation, being the result of a pure SVM 
method, has an unique solution [1]; b) this formulation equals the formulation ()3.4|) for 
ordinal data previously introduced, with K = 3, s = K — 1 = 2, and a slightly modified 
objective function by the introduction of a regularization member, proportional to the 
distance between the hyperplanes. The oSVM solution is the one that simultaneously 
minimizes the distance between boundaries and maximizes the minimum of the margins 
- figure 13.31 The h parameter controls the tradeoff between the objectives of maximizing 
the margin of separation and minimizing the distance between the hyperplanes. 

To reiterate, the data replication method enabled us to formulate the classification of ordinal 
data as a standard SVM problem, removing the ambiguity in the solution by the introduction 
of a regularization term in the objective function. 



With the material on how to construct a set of optimal hyperplanes for the toy example, we 



3.1. MAPPING THE DATA REPLICATION METHOD TO SVMS 



23 





(a) Original dataset in 



(b) Data set in R 2 , with sam- 
ples replicated, s — 2 and 
oSVM solution to the binary 
problem. 



(c) oSVM solution in the 
original feature space. 



Figure 3.3: Effect of the regularization member in the oSVM solution. 



are now in a position to formally describe the construction of a support vector machine for 
ordinal classification. 

Consider a general extended dataset, as defined in Q2.2JI . After the simplifications and change 
of variables suggested in the toy example, the binary SVM formulation for this extended 
dataset yields 



mm -w w 

w,6i,fi 2 



1 K ~ 1 
h? E 



(6, -6i) : 



K— 1 min(K,q + s) l k 

+^E E EC 

q=1 fc — max(l,q — s + l) i=l 



(3.7) 



with the same set of constraints as (|3.3j) . 



This formulation for the high-dimensional dataset matches the proposed formulation for 
ordinal data up to an additional regularization member in the objective function. This 
additional member is responsible for the unique determination of the biases.* 

It is important to stress that the complexity of the SVM model does not depend on the 
dimensionality of the data. So, the only increase in the complexity of the problem is due to 
the duplication of the data (more generally, for a K-class problem, the dataset is increased 
at most (K — 1) times). As such, it compares favourably with the formulation in [27], which 
squares the dataset. 



* Different regulation members could be obtained by different extensions of the dataset. For example, if e q 
had been defined as the sequence h, ■ ■ ■ , h, 0, • ■ ■ ,0, with q h's and [K — 2 — q) 0's, the regularization member 

would be i Y,i=2 2 ■ 
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Figure 3.4: oSVM interpretation of an ordinal multiclass problem as a two-class problem. 



Nonlinear boundaries 

As explained before, the search for nonlinear level curves can be pursued in the extended 
feature space by searching for a partially linear function G(x) = G(x)+w'ej. Since nonlinear 
boundaries are handled in the SVM context making use of the well known kernel trick, a 
specified kernel K(xi,Xj) in the original feature space can be easily modified to K(xi,Xj) = 
K{xi,x.j) + ex.e Xj . in the extended space. 

Summarizing, the nonlinear ordinal problem can be solved by extending the feature set and 
modifying the kernel function, figure 13.41 As we see, the extension to nonlinear decision 
boundaries follows the same reasoning as with the standard SVM [1]. 



Independent boundaries 

Considering now the setup for independent boundaries, as presented in (|2.4[) . the linear, 
binary SVM formulation yields 
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We see that if the regularization term Yld^ 1 ^'~ 2 fel ^ i s zero ( m practice, sufficiently small), 
the optimization problem could be broken in K — 1 independent optimization problems, 
reverting to the procedure of Frank and Hall [22] . 



3.2 Mapping the data replication method to NNs 



By letting G(x) be the output of a neural network, a flexible architecture for ordinal data 
can be devised as represented diagrammatically in figure G(x) is the output of a generic 
feedforward network (in fact, it could be any neural network, with a single output), which 
is then linearly combined with the added {K — 2) components. 

For the simple case of searching for linear boundaries, the overall network simplifies to a 
single neuron with p + K — 2 inputs. A less simplified model, also used in the conducted 
experiments, is to consider a single hidden layer, as depicted in figure I3TH1 

Interestingly, it is possible to provide a probabilistic interpretation to this neural network 
model. 
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3.2.1 Ordinal logistic regression model 



The traditional statistical approach for ordinal classification models the cumulative class 
probability = p(C < fc|x) by 

logit(P fc ) = $ fc - G(x) ^ P k = logsig(cD fc - G(x)), k = 1, • • • , K - 1 (3.9) 

Remember that logit(y) = In logsig(y) = 1+ \- y and logsig(logit(y)) = y. 

For the linear version ([26, 28]) we take G(x) = w'x. Mathieson [5] presents a nonlinear 
version by letting G(x) be the output of a neural network. However other setups can be 
devised. Start by observing that in ()3.9|) we can always assume $1 = by incorporating 
an appropriate additive constant in G(x). We are left with the estimation of G(x) and 
(K — 2) cut points. By fixing /tvQ = logsigQ as the activation function in the output layer 
of our oNN network, we can train the network to predict the values Pfc(x), when fed with 
x = [e fe x ! ], k = 1, • • • , K — 1 . By setting C\ = 1 and C2 = we see that the extended 
dataset as defined in (|2.2|) can be used to train the oNN network. The predicted cut points 
are simply the weights of the connection of the added K — 2 components, scaled by h. 

Illustrating this model with the synthetic dataset from Mathieson [5] , we attained the decision 
boundaries depicted in figure l3~71 



3.2. MAPPING THE DATA REPLICATION METHOD TO NNS 



27 




Figure 3.7: Decision boundaries for the oNN with 3 units in the hidden layer, for a synthetic 
dataset from Mathieson [5]. C\ = o, C 2 = C3 = <, C4 = * 



Chapter 4 

The unimodal method for NNs 



Given a new query point x, Bayes decision theory suggests to classify x in the class which 
maximizes the a posteriori probability P(Cfc|x). To do so, one usually has to estimate these 
probabilities, either implicitly or explicitly. Suppose for instance that we have 7 classes and, 
for a given point xo, the highest probability is P(C5|xo); we then assign class C5 to the given 
point. If there is not an order relation between the classes, it is perfectly natural that the 
second highest a posteriori probability is, for instance, P(C2|x). However, if the classes are 
ordered, C\ < C2 <,...,< C7, classes C4 and Cq are closer to class C5 and therefore the 
second and third highest a posteriori probabilities should be attained in these classes. This 
argument extends easily to the classes, C3 and C7, and so on. This is the main idea behind 
the method proposed here, which is now detailed. 

Our method assumes that in a supervised classification problem with ordered classes, the 
random variable class associated with a given query x should be unimodal. That is to say 
that if we plot the a posteriori probabilities -P(Cfcjx), from the first C\ to the last Ck, there 
should be only one mode in this graphic. Here, we apply this idea in the context of neural 
networks. Usually in neural networks, the output layer has as many units as there are classes, 
K. We will use the same order for these units and the classes. In order to force the output 
values (which represent the a posteriori probabilities) to have just one mode, we will use a 
parametric model for these output units. This model consists in assuming that the output 
values come from a binomial distribution, B(K — l,p). This distribution is unimodal in 
most cases and when it has two modes, these are for contiguous values, which makes sense 
in our case, since we can have exactly the same probability for two classes. This binomial 
distribution takes integer values in the set {0, 1, . . . , K — 1}; value corresponds to class 
Ci, value 1 to class C2 and so on until value K — 1 to class Ck- As K is known, the only 
parameter left to be estimated from this model is the probability p. We will therefore use 

§The text and idea presented in this section is thanks to Prof. Joaquim F. Pinto da Costa. Some portions 
of this chapter appeared in [24]. 
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a different architecture for the neural network; that is, the output layer will have just one 
output unit, corresponding to the value of p - figure l4~Tl For a given query x, the output of 
the network will be a single numerical value in the range [0,1], which we call p x . Then, the 
probabilities P(Cfc|x) are calculated from the binomial model: 

nCfc,Xj " (k-l)\(K-k)\ > fc-M.---.* 

In fact these probabilities can be calculated recursively, to save computing time: 

P(C k \x) pxjK-k + 1) 

p(c fc _i|x) (k-i)(i- Px y 

and so 

p(c fc w = m-iix) (fc _ 1)(1 _ px) . 

We start with P(Ci|x) = (1 — p*) K ~ l and compute the other probabilities, P(C&|x), /c = 
2, 3, . . . , K, using the above formula. 

When the training case x is presented, the error is defined as 

K 



error = £ |P(C fc |x) - S(k - C x )| 2 (4.1) 



fc=i 

1 if n = 

where 5(n) = ^ and C x the true class of x. The network is trained to minimize 

otherwise 
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the average value over all training cases of such error. Finally, in the test phase, we choose 
the class k which maximizes the probability P(Cfc). As trivially confirmed, that simplifies to 
the rounding of f + (K — l)p x to the nearest integer, where p x is the network output. 

4.1 Connection with regression models 

Consider the following equivalences: 

min Px Ef=i \P(C k \x) - 6(k - C x )| & min Px 1 - P(C x |x) + £ fc _, Cx P(C fc |x) & 
min Px 2 — 2P(C x |x) ■<=> max Px P(C x |x) 

Let p x pt the parameter that maximizes P(C x |x). For the binomial case p x pt = . Then 

maxP(C x |x) (4.2) 

Px 

and 

min|p x -p° pt | (4.3) 

Px 

both attain the global optimal value at the same p x value. That is to say that the training 
of the network could be also performed by minimizing the error of the network output to the 
optimal parameter: a simple case of regression of the parameter of the binomial distribution. 

Note that both approaches are not mathematically equivalent. Although they share the 
same global optimum, the error surface is different and is natural that practical optimization 
algorithms stop at different values, maybe trapped at some local optimum value. Another 
way of looking to the problem is to say that both are a regression of the parameter p x , using 
different error measures. The advantage of minimizing directly min Px \p x — |, or the 
squared version of it, is that it fits directly in existing software packages. However, both 
impose a unimodal distribution of the output probabilities. 

As the above formulation suggests, the adjustment of any probability distribution, dependent 
on a single parameter resumes to a regression of that parameter against its optimal value. 
This approach is then part of a larger set of techniques to estimate by regression any ordered 
scores s\ < ■ ■ ■ < sk - the simplest case would be the set of integers 1, • • • , K. [5, 29, 30] 

Using a neural network with not one but two outputs, it is natural to extend the former 
reasoning to unimodal distribution with two parameters, as a greater flexibility should bring 
a better fitting to the data. The training could be performed directly with some of the 
regression errors discussed above and the test phase would be just the selection of the mode 
class dictated by the network output. 



Chapter 5 



Experimental Results 



Next we present experimental results for several models based on SVMs and NNs, when 
applied to several datasets, ranging from synthetic datasets, real ordinal data, to quantized 
data from regression problems. 

5.1 SVM based algorithms 

We compare the following algorithms: 

• A conventional multiclass SVM formulation (cSVM), based on the one-against-one 
decomposition. The one-against-one decomposition transforms the multiclass problem 
into a series of K(K — l)/2 binary subtasks that can be trained by a binary SVM. 
Classification is carried out by a voting scheme. 

• Pairwise SVM (pSVM): Frank and Hall [22] introduced a simple algorithm that en- 
ables standard classification algorithms to exploit the ordering information in ordinal 
prediction problems. First, the data is transformed from a .fT-class ordinal problem 
to K — 1 binary class problems. To predict the class value of an unseen instance the 
probabilities of the K original classes are estimated using the outputs from the K — 1 
binary classifiers. 

• Herbrich [27] model (hSVM), based on the correspondence of the ordinal regression 
task and the task of learning a preference relation on pairs of objects. A function loss 
was defined on pairs of objects and the classification task formulated in this space. 
The size of the new training set, derived from an ^-sized training set, can be as high 
as £ 2 . Only the direction w was computed directly from this model. Scalars bi were 
obtained in a second step, performing a 1-dimensional SVM. Due to limitations of the 
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implementation of this method and its excessively long training time, some results are 
not available (NA). 

• Proposed ordinal method (oSVM), based on the data extension technique, as previously 
introduced. 

Experiments were carried out in Matlab 7.0 (R14), using the Support Vector Machine 
toolbox, version 2.51, by Anton Schwaighofer. This toolbox was used to construct the oSVM 
classifier, the Herbrich [27] model and the pairwise SVM. It was also used the STPRtool, 
version 2.01, for the implementation of the generic multiclass SVM. The C and h parameters 
were experimentally tuned for the best performance. 

5.2 Neural network based algorithms 

We compare the following algorithms: 

• Conventional neural network (cNN). To test the hypothesis that methods specifically 
targeted for ordinal data improve the performance of a standard classifier, we tested a 
conventional feed forward network, fully connected, with a single hidden layer, trained 
with the traditional least square approach and with the special activation function 
softmax. For each case study, the result presented is the best of the two configurations. 

• Pairwise NN (pNN): mapping in neural networks the strategy of [22] mentioned above 
for pSVM. 

• Costa [31], following a probabilistic approach, proposes a neural network architecture 
(iNN) that exploits the ordinal nature of the data, by defining the classification task on 
a suitable space through a "partitive approach". It is proposed a feedforward neural 
network with K — 1 outputs to solve a X-class ordinal problem. The probabilistic 
meaning assigned to the network outputs is exploited to rank the elements of the 
dataset. 

• Proposed unimodal model (uNN). Several variants of the unimodal model were gauged, 
ranging from one-parameter distributions, such as the binomial and the poison, to two- 
parameter distributions, such as the hypergeometric and the gaussian distribution. 
Other ideas such as modifying a conventional neural network to penalise multimodal 
outputs were also considered. However, models whose optimization did not fit directly 
under a standard implementation of the backpropagation algorithm were optimized 
with generic optimization functions available in Matlab. Presumably due to that 
fact, the best results were obtained when performing direct regression of the binomial 
parameter, as in (|4.3|) . for which we present the results. 
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• Proposed ordinal method (oNN), based on the data extension technique, as previously 
introduced. 

Experiments were carried out in Matlab 7.0 (R14), making use of the Neural Network 
Toolbox. All models were configured with a single hidden layer and trained with Levenberg- 
Marquardt back propagation method, over 2000 epochs. 

The number of neurons in the hidden layer was experimentally tuned for the best perfor- 
mance. 

5.3 Measuring classifier performance 

Having built a classifier, the obvious question is "how good is it?". This begs the question 
of what we mean by good. The obvious answer is to treat every misclassification as equally 
likely, adopting the misclassification error rate (MER) criterion to measure the performance 
of the classifier. However, for ordered classes, losses that increase with the absolute difference 
between the class numbers are more natural choices in the absence of better information [5]. 

The mean absolute error (MAE) criterion takes into account the degree of misclassification 
and is thus a richer criterion than MER. The loss function corresponding to this criterion is 
Z(/( X ) >y ) = |/( X )-y|. 

A variant of the above MAE measure is the mean square error (MSE), where the absolute 
difference is replaced with the square of the difference, Z(/(x), y) = (/(x) — y) 2 . 

Finally, the performance of the classifiers was also assessed with the Spearman (r s ) and 
Kendall's tau-b (r) coefficients, nonparametric rank-order correlation coefficients well estab- 
lished in the literature [32]. A proposal for yet another coefficient, o c , was also implemented. ^ 
To define o c , we start with the N data points (xi,yi) and consider all ^N(N — 1) pairs of 
data points. Following the notation in [32], we call a pair concordant if the relative ordering 
of the ranks of the two x's is the same as the relative ordering of the ranks of the two y's. 
We call a pair discordant if the relative ordering of the ranks of the x's is opposite from the 
relative ordering of the ranks of the two y's. If there is a tie in either the ranks of the two 
x's or the ranks of the two y's, then we do not call the pair either concordant or discordant. 
If the tie is in the x's, we will call the pair an "extra x pair", e x . If the tie is in the y's, we 
will call the pair an "extra y pair", e y . If the tie is both on the x's and the y's, we ignore 
the pair. 

Inspired by the work of Lerman [33,34], a simplified coefficient was conceived from a set 
theoretic representation of the two variables to be compared. After a straight forward 
^The idea for this coefficient is thanks to Prof. Joaquim F. Pinto da Costa. 
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mathematical manipulation, the o c coefficient can be computed as 

concordant 

o c = -1 + 2- 



V concordant + discordant + e x sj concordant + discordant + e y 
where the scale factor and bias are used just to set the parameter between 1 and — 1. 
This expression shows a striking resemblance with the formula for Kendall's r: 

concordant — discordant 



V concordant + discordant + e x ^ concordant + discordant + e y 



Chapter 6 

Results for a synthetic dataset 



6.1 Results for neural networks methods 



In a first comparative study we generated a synthetic dataset in a similar way to Herbrich 
[27]. 

We generated 1000 example points x = [x\ x^ uniformly at random in the unit square 
[0, 1] x [0, 1] C M 2 . Each point was assigned a rank y from the set {1, 2, 3, 4, 5}, according to 



y = min {r : < 10(xi - 0.5)(x 2 - 0.5) + e < b r } 

re{l,2,3,4,5} 

{bo,b 1 ,b 2 ,b 3 ,b 4: ,b 5 ) = (-oo, -1, -0.1, 0.25, 1, +oo) 
where e is a random value, normally distributed with zero mean and standard deviation 



a = 0.125. Figure 6.1(a) shows the five regions and figure 6.1(b) the points which were 
assigned to a different rank after the corruption with the normally distributed noise. 

In order to compare the different algorithms, and similarly to [27], we randomly selected 
training sequences of point-rank pairs of length i ranging from 20 to 100. The remaining 
points were used to estimate the classification error, which were averaged over 100 runs of 
the algorithms for each size of the training sequence. Thus we obtained the learning curves 
shown in figure IH721 for 5 neurons in the hidden layer. 



6.1.1 Accuracy dependence on the number of classes 

To investigate the relation between the number of classes and the performance of the 
evaluated algorithms, we also ran all models on the same dataset but with 10 classes. 
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(a) Classes' boundaries. 



(b) Scatter plot of the data 
points wrongly ranked. Num- 
ber of wrong points: 14.2%. 



(c) Class distribution. 



Figure 6.1: Test setup for 5 classes in 






(a) MER criterion. 



(b) MAE criterion. 



(c) MSE criterion. 






(d) Spearman coefficient. (e) Kendall's tau-b criterion. (f) o c criterion. 

Figure 6.2: NN results for 5 classes in R 2 , with 5 hidden units. 
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(a) Classes' boundaries. 



(b) Scatter plot of the data 
points wrongly ranked. Num- 
ber of wrong points: 13.9%. 



(c) Class distribution. 



Figure 6.3: Test setup for 10 classes in 



This time each point was assigned a rank y from the set {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, according to 



y = min {r : 6 r _i < 10(^1 — 0.5) (£2 — 0.5) + e < b r } 

re{l, 2,3,4,5,6,7,8,9,10} 



(6 , h, 6 2 , 63, b 4 , 65, 6 6 , 67, 6 8 , 69, b w ) = (-00, -1.75, -1, -0.5, -0.1, 0.1, 0.25, 0.75, 1, 1.75, +00) 
where e is a random value, normally distributed with zero mean and standard deviation 



a = 0.125/2. Figure 6.3(a) shows the ten regions and figure 6.3(b) the points which were 
assigned to a different rank after the corruption with the normally distributed noise. The 
learning curves obtained for this arrangement are shown in figure 16.41 (again, for 5 neurons 
in the hidden layer). 



6.1.2 Accuracy dependence on the data dimension 

The described experiments in M. 2 were repeated for data points in M 4 , to evaluate the influence 
of data dimension on models' relative performance. 

We generated 2000 example points x = [xi X2 ^3 X4] uniformly at random in the unit square 
in M 4 . 

For 5 classes, each point was assigned a rank y from the set {1,2,3,4,5}, according to 



y = min {r : b r -i < 1000 TTfe - 0.5) + e < b r \ 

r€{l, 2,3,4.5} 1 *; 

(60,61,62,^3,64) = (-oo,-2.5, -0.5, 0.5, 3, +00) 
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(d) Spearman coefficient. (e) Kendaii's tau-b coefficient. (f) o c coefficient. 

Figure 6.4: NN results for 10 classes in M 2 , with 5 hidden units. 
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8 9 10 



(a) K=5. (b) K=10. 

Figure 6.5: Class distribution in M 4 . 



Model 


cNN 


pNN 


iNN 


uNN 


oNN 


B?, K = 5 


45 


21 x 4 


39 


21 


23 


R*, K = 10 


75 


21 x 9 


69 


21 


28 


R 4 , K = 5 


165 


97 x 4 


148 


97 


100 


R 4 , K = 10 


250 


97 x 9 


233 


97 


105 



Table 6.1: Number of parameters for each neural network model. 



where £ is a random value, normally distributed with zero mean and standard deviation 
a = 0.25. 

Finally, for 10 classes the rank was assigned according to the rule 

i 



y = mm 

r€{l,2,3,4,5,6, 



{r : 6 r _i < 1000 TT(^ - 0.5) + e < b r ] 

7,8,9,10} M 

i=l 



Oo, bi,b 2 , h, 6 4 , h, b 6 , b 7 , b 8 , b 9 , b w ) 



(-00, -5, -2.5, -1, -0.4, 0.1, 0.5, 1.1, 3, 6, +00) 



where e is a random value, normally distributed with zero mean and standard deviation 
a = 0.125. Class distributions are presented in figure l6~5l the learning curves are shown in 
figures HOH and l6~Tl for 16 neurons in the hidden layer. 



6.1.3 Network complexity 



One final point to make in any comparison of methods regards complexity. The number of 
learnable parameters for each model is presented in table l6~Tl 
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(d) Spearman coefficient. (e) Kendaii's tau-b coefficient. (f) o c coefficient. 

Figure 6.6: NN results for 5 classes in R , with 16 hidden units. 
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(d) Spearman coefficient. (e) Kendaii's tau-b coefficient. (f) o c coefficient. 

Figure 6.7: NN results for 10 classes in R 4 , with 16 hidden units. 
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(a) 5 classes in I 2 . C = 10000, (b) 10 classes in R 2 . C = 10000, 
h = 10, K(x, y) = (1 + x'y) 2 - h = 10, K(x, y) = (1 + x 4 y) 2 . 





(c) 5 classes in R 4 . C = 10000, 
h = 10, K(x, y) = (l+x'y) 4 . 



(d) 10 classes in R 4 . C = 10000, 
h = 10, K(x, y) = (1+x'y) 4 . 



Figure 6.8: SVM results - MER criterion. 



6.2 Results for SVM methods 



Because the comparative study for the SVM based methods followed the same reasoning as 
for the neural network methods, we restrict to present here the attained results in figures 
8(a) 6.8(b)} 6.8(c) and 6.8(d) Because all classification indices portrayed essentially the 
same relative performance, and to facilitate the comparison with results previously reported 
in the literature, we will restrict here and in the future to the MER criterion. 



6.3 Discussion 



A first comment relates to the unfairness of comparing SVM to NN based methods since 
the kernel parameters were illegally tuned to the datasets. The main assertions concerns 
the superiority of all algorithms specific to ordinal data over conventional methods, both for 
SVMs and NNs; the proposed method, in spite of being the simplest model, performs as 
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good or better than the other models under comparison. 



Chapter 7 



Results for practical ordinal 
datasets 

The next sections present results for datasets with real data. 

7.1 Pasture production 

The next experiment is based on a publicly available dataset with real-life data, available 
at the WEKA website*. The object ive was to predict pasture production from a variety 
of biophysical factors. Vegetation and soil variables from areas of grazed North Island 
hill country with different management (fertilizer application/stocking rate) histories (1973- 
1994) were measured and subdivided into 36 paddocks. Nineteen vegetation (including 
herbage production); soil chemical, physical and biological; and soil water variables were 
selected as potentially useful biophysical indicators - table 17.11 The target feature, the 
pasture production, has been categorized in three classes (Low, Medium, High), evenly 
distributed in the dataset of 36 instances. 

The results attained are summarized in table 17.21 Before training, the data was scaled to 
fall always within the range [0,11, using the transformation x' = x ~ Xmin . The fertiliser 

^max ^min 

attribute was represented using 4 variables: LL = (1, 0, 0, 0), LN = (0, 1, 0, 0), HL = (0, 
0, 1, 0) and HH = (0, 0, 0, 1). 

The lack of motivation to impose an ordered relation in the fertiliser attribute, suggests a 
good scenario to apply the general version of the data replication method, where only 21 
attributes (j = 21) are constrained to have the same direction, with the fertiliser attribute left 

1 http: //www. cs .waikato . ac . nz/ml/ we ka/ ] 

The information is a replica of the notes made available with the data. 
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Name 


Data Type 


Description 


fertiliser 


enumerated (LL, LN, HN, HH) 


fertiliser used 


slope 


integer 


slope of the paddock 


aspect-dev-NW 


integer 


the deviation from the north-west 


OlsenP 


integer 




MinN 


integer 




TS 


integer 




Ca-Mg 


real 


calcium magnesium ration 


LOM 


real 


soil lorn (g/lOOg) 


NFIX-mean 


real 


a mean calculation 


Eworms-main-3 


real 


main 3 spp earth worms per g/m2 


Eworms-No-species 


integer 


number of spp 


KUnSat 


real 


mm/hr 


OM 


real 




Air-Perm 


real 




Porosity 


real 




HFRG-pct-mean 


real 


mean percent 


legume-yield 


real 


kgDM/ha 


OSPP-pct-mean 


real 


mean percent 


Jan-Mar-mean-TDR 


real 




Armual-Mean-Runoff 


real 


mm 


root-surface- area 


real 


m2/m3 


Leaf-P 


real 


ppm 



Table 7.1: Characteristics of the 22 features of the Pasture dataset. 



kernel 


cSVM 


pSVM 


hSVM 


oSVM 


A'(x,y) =x t y 


27.8 (C=0.2) 


27.8 (C=1.0) 


27.8 (C=0.01) 


27.8 (C=0.5) 


A-(x,y) = (i + x'yr 


25.0 (C=0.04) 


25.0 (C=0.2) 


25.0 (C=0.01) 


22.2 (C=0.02) 



(a) SVMs' results, h — 100, s = 1, leave-one-out. 



hidden units 


cNN 


pNN 


iNN 


uNN 


oNN 





35.6 


48.1 


34.2 


56.7 


55.0 


4 


36.1 


37.5 


33.6 


35.3 


38.3 



(b) NNs' results, h = 1, s = 2, leave-one-out. 



Table 7.2: MER (%) for the Pasture dataset. 

free. Using a linear kernel with C = 0.5 (h = 100, s = 1) emerges a classifier with expected 
MER of 22.2%. This way, a very simple classifier was obtained at the best performance. 

7.2 Employee selection: the ESL dataset 

The next experiment is also based on a publicly dataset available at the WEKA website. 
The ESL dataset contains 488 profiles of applicants for certain industrial jobs. Expert 
psychologists of a recruiting company, based upon psychometric test results and interviews 
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with the candidates, determined the values of the input attributes (4 attributes, with integer 
values from to 9). The output is an overall score (1..9) corresponding to the degree of 
fitness of the candidate to this type of job, distributed according to figure ITT1 



140 
120 
100 




123456789 



overall score 

Figure 7.1: Class distribution of 488 examples, for the ESL dataset. 



The comparative study of the learning algorithms followed the same reasoning as for the 
synthetic datasets; therefore we restrict to present here the attained results for the MER 



criterion - figures 7.2(a) and 7.2(b) 



In the pasture dataset conventional methods performed as well as ordinal methods, while 
algorithms based on SVMs clearly outperformed NN based algorithms - an expected result 
if we attend to the limited number of examples in the dataset. On the other side, for the 
ESL dataset, there is no discernible difference between SVM and NN based algorithms, but 
conventional methods are clearly behind specific methods for ordinal data. 



7.3 Aesthetic evaluation of breast cancer conservative treat- 
ment § 

In this section we illustrate the application of the learning algorithms to the prediction of 
the cosmetic result of breast cancer conservative treatment. 

Breast cancer conservative treatment (BCCT) has been increasingly used over the last few 
years, as a consequence of its much more acceptable cosmetic outcome than traditional 
techniques, but with identical oncological results. Although considerable research has been 
put into BCCT techniques, diverse aesthetic results are common, highlighting the importance 
of this evaluation in institutions performing breast cancer treatment, so as to improve 
working practices. 

Traditionally, aesthetic evaluation has been performed subjectively by one or more observers 
§ Some portions of this section appeared in [6,23]. 
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50 100 150 200 250 




50 100 150 200 250' 



Training set size 



(a) SVM results. K(x, y) = x*y, C = 
100, h = 0.5. 



100 150 200 25g 

I - ♦ - cNN I 

■ ■ O " iNN 




0.2 1 1 1 1 1 0.2 

50 100 150 200 250 

Training set size 

(b) NN results, h = 0.5, no hidden 
layer. 



Figure 7.2: Results for the ESL dataset, MER criterion. 



[35-37]. However, this form of assessment has been shown to be poorly reproducible [38- 
41], which creates uncertainty when comparing results between studies. It has also been 
demonstrated that observers with different backgrounds evaluate cases in different ways 
[42]. 

Objective methods of evaluation have emerged as a way to overcome the poor reproducibility 
of subjective assessment and have until now consisted of measurements between identifiable 
points on patient photographs [38,41,43]. The correlation of objective measurements with 
subjective overall evaluation has been reported by several authors [39-41,44]. Until now 
though, the overall cosmetic outcome was simply the sum of the individual scores of subjec- 
tive and objective individual indices [39,40,44,45]. 



7.3.1 Data and method 

Instead of heuristically weighting the individual indices in an overall measure, we introduced 
pattern classification techniques to find the correct contribution of each individual feature in 
the final result and the scale intervals for each class, constructing this way an optimal rule 
to classify patients. 



7.3.1.1 Reference classification 

Twenty-four clinicians working in twelve different countries were selected, based on their 
experience in BCCT (number of cases seen per year and/or participation in published work 
on evaluation of aesthetic results). They were asked to evaluate individually a series of 
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Class 


# cases 


Poor 


7 


Fair 


12 


Good 


32 


Excellent 


9 



Table 7.3: Distribution of patients over the four classes. 

240 photographs taken from 60 women submitted to BCCT (surgery and radiotherapy). 
Photographs were taken (with a 4M digital camera) in four positions with the patient 
standing on floor marks: facing, arms down; facing, arms up; left side, arms up; right 
side, arms up - figure 1731 




Figure 7.3: Positions used in the photographs. 



Participants were asked to evaluate overall aesthetic results, classifying each case into one 
of four categories: excellent - treated breast nearly identical to untreated breast; good - 
treated breast slightly different from untreated; fair - treated breast clearly different from 
untreated but not seriously distorted; poor - treated breast seriously distorted [35]. 

In order to obtain a consensus among observers, the Delphi process was used [46, 47]. 
Evaluation of each case was considered consensual when more than 50% of observers provided 
the same classification. When this did not occur, another round of agreement between 
observers was performed. By means of the Delphi process each and every patient was 
classified in one of the four categories (table l7~3|) : poor, fair, good, and excellent. 

The evaluation of two individual aesthetic characteristics, scar visibility and colour dissimi- 
larities between the breasts, were asked to the panel, using the same grading scale: excellent; 
good; fair; poor. 

7.3.1.2 Feature Selection 

As possible objective features we considered those already identified by domain experts 
as relevant to the aesthetic evaluation of the surgical procedure [38, 43] . The cosmetic 
result after breast conserving treatment is mainly determined by visible skin alterations or 
changes in breast volume or shape. Skin changes may consist of a disturbing surgical scar 
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or radiation- induced pigmentation or telangiectasia [43]. Breast asymmetry was assessed 
by Breast Retraction Assessment (BRA), Lower Breast Contour (LBC) or Upward Nipple 
Retraction (UNR) - figure EU Because breast asymmetry was insufficient to discriminate 



Figure 7.4: LBC = \L r - L t \, BRA = y/(X r - X t ) 2 + (Y r - Y[) 2 , UNR =\Y r -Y l \. 



among patients, we adopted the mean of the scar visibility and skin colour change, as 
measured by the Delphi panel, as additional features to help in the separation task, as 
we had not yet established the evaluation of those features by quantitative methods [23]. 

7.3.1.3 Classifier 

The leave one out method [8] was selected for the validation of the classifiers: the classifier 
is trained in a round-robin fashion, each time using the available dataset from which a single 
patient has been deleted; each resulting classifier is then tested on the single deleted patient. 

When in possession of a nearly separable dataset, a simple linear separator is bound to 
misclassify some points. But the real question is if the non-linearly-separable data indicates 
some intrinsic property of the problem (in which case a more complex classifier, allowing more 
general boundaries between classes may be more appropriate) or if it can be interpreted as 
the result of noisy points (measurement errors, uncertainty in class membership, etc), in 
which case keeping the linear separator and accept some errors is more natural. Supported 
by Occam's razor principle ("one should not increase, beyond what is necessary, the number 
of entities required to explain anything"), the latter was the option taken in this research. 



A fast visual checking of the quality of the data (figure I7.5|) shows that there is a data 
value that is logically inconsistent with the others: an individual (patient #31) labeled as 
good when in fact it is placed between fair and poor in the feature space. The classifiers were 




Datasets 
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o Scar visibility 



LBC (cm), arms down 



Figure 7.5: Data points in a three-feature space. 




(a) Results for SVM methods. (b) Results for NN methods, h = 
C = 10, h = 100, K (x, y) = x*y- 100, no hidden units. 



Figure 7.6: Average of generalization error (MER). 



evaluated using datasets with and without this outlier in order to assess the behaviour in the 
presence of noisy examples. In summary, results are reported for six different datasets: {LBC 
(arms down); scar visibility (mean); skin colour change (mean)}, {BRA (arms down); scar 
visibility (mean); skin colour change (mean)}, {UNR (arms down); scar visibility (mean); 
skin colour change (mean)}, each with 59 and 60 examples. In [23] other datasets were 
evaluated, showing similar behaviour. 



Results 



The bar graph 17.61 summarizes the generalization error estimated for each classifier. It is 
apparent that algorithms specially designed for ordinal data perform better than generic 
algorithms for nominal classes. It is also noticeable the superiority of the LBC measure over 
the other asymmetry measures under study to discriminate classes. 



Chapter 8 



Results for datasets from regression 
problems 

Because of the general lack of benchmark datasets for ordinal classification, we also performed 
experiments with datasets from regression problems, by converting the target variable into 
an ordinal quantity. The datasets were taken from a publicly available collection of regression 
problems^. 

8.1 Abalone dataset 

The goal is to predict the age of abalone from physical measurements."'' The age of abalone 
is determined by cutting the shell through the cone, staining it, and counting the number 
of rings through a microscope - a boring and time-consuming task. Other measurements, 
which are easier to obtain, are used to predict the age. Further information, such as weather 
patterns and location (hence food availability) may be required to solve the problem. 

Examples with missing values were removed from the original data (the majority missing 
the predicted value), and the ranges of the continuous values have been scaled for the use 
with an artificial neural network (by dividing by 200). The sex attribute was represented 
as M = 1, F = 0, / = —1. The characteristics of the dataset are summarized in table 
18.11 where are listed the attribute name, attribute type, the measurement unit and a brief 
description; the class distribution is depicted in figure 

The results obtained, table |HH3 can be confronted with results reported in previous studies 
- table IO 

^The datasets were selected from |http : / /www . liacc . up . pt/ ~ torgo/Regression/DataSets . html 
*The information is a replica of the notes for the abalone dataset from the UCI repository. 
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Name 


Data Type 


Meas. 


Description 


Sex 


nominal 


M, F. and I (infant) 




Length 


continuous 


mm 


Longest shell measurement 


Diameter 


continuous 


mm 


perpendicular to length 


Height 


continuous 


mm 


with meat in shell 


Whole weight 


continuous 


grams 


whole abalone 


Shucked weight 


continuous 


grams 


weight of meat 


Viscera weight 


continuous 


grams 


gut weight (after bleeding) 


Shell weight 


continuous 


grams 


after being dried 


Rings 


integer 




+ 1.5 gives the age in years 



Table 8.1: Characteristics of the abalone dataset. 











r-T 






TTh-n 



5 10 15 20 25 30 

rings 



Figure 8.1: Class distribution over 4177 examples, for the abalone dataset. 





C4.5 ORD 


C 4.5 


C4.5-1 PC 


3 classes 


34.9 


36.1 


34.1 


5 classes 


51.9 


53.7 


50.5 


10 classes 


70.6 


73.3 


72.6 





CRT 


MDT 


3 classes 


47.7 


54.7 


5 classes 


62.1 


57.8 


10 classes 


78.4 


70.7 



(a) Results reported in [22]. 



(b) Results reported in 
[12]. 



Table 8.2: MER (%) for the Abalone dataset with decision trees, using equal frequency 
binning. 





cSVM 


pSVM 


hSVM 


oSVM 


3 classes 


37.4 


36.8 


NA 


37.0 


5 classes 


53.3 


54.2 


NA 


54.4 


10 classes 


73.5 


74.1 


NA 


73.7 



(a) SVMs' results. C = 1000, h = 1, s 
K(x, y) = x*y, training set size = 200. 





cNN 


pNN 


iNN 


uNN 


oNN 


3 classes 


37.4 


37.3 


37.4 


37.9 


37.2 


5 classes 


53.0 


53.9 


54.3 


60.5 


55.5 


10 classes 


73.3 


74.0 


74.7 


80.8 


75.9 



(b) NNs' results, h = 
training set size = 200. 



1, no hidden units, 



Table 8.3: MER (%) for the Abalone dataset, using equal frequency binning. 
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If it is true that generally we might prefer simpler models for explanation at the same 
performance - a parsimonious representation of the observed data -, then the simple weighted 
sum of the attributes yielded by the data replication method is clearly in advantage. 



8.2 CPU performance dataset 



The goal is to predict the relative CPU performance. From the 10 initial attributes 6 were 
used as predictive attributes and 1 as the goal field, discarding the vendor name, model 
name and estimated relative performance from the original article. The characteristics of 
the fields used are summarized in table I87H for the 209 instances. 



Name 


Data Type 


Description 


Min 


Max 


MYCT 


integer 


machine cycle time in nanoseconds 


17 


1500 


MMIN 


integer 


minimum main memory in kilobytes 


64 


32000 


MMAX 


integer 


maximum main memory in kilobytes 


64 


64000 


CACH 


integer 


cache memory in kilobytes 





256 


CHMIN 


integer 


minimum channels in units 





52 


CHMAX 


integer 


maximum channels in units 





176 


PRP 


integer 


published relative performance 


6 


1150 



Table 8.4: Characteristics of the CPU performance dataset. 



Before training, the predictive attributes were scaled to fall always within the range [0,1], 
using the transformation x' = HtzJSsds, _ The results obtained, table lcH)l can be confronted 

2- max 3? min 

with results reported in previous studies - table 18.51 





C4.5 ORD 


C 4.5 


C4.5-1 PC 


3 classes 


26.1 


28.2 


25.7 


5 classes 


41.9 


43.2 


43.4 


10 classes 


63.5 


63.8 


69.4 





CRT 


MDT 


3 classes 


45.9 


31.1 


5 classes 


45.0 


40.7 


10 classes 


57.9 


57.4 



(a) Results reported in [22]. 



(b) Results reported in 
[12]. 



Table 8.5: MER (%) for the Machine CPU dataset, using decision trees. 



These results continue to suggest the merit of specific methods for ordinal data over conven- 
tional methods, attaining the best performance at the greatest simplicity. 
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cSVM 


pSVM 


hSVM 


oSVM 


3 classes 


23.9 


22.9 


NA 


23.4 


5 classes 


39.7 


43.0 


NA 


44.6 


10 classes 


69.0 


65.4 


NA 


67.8 



(a) SVMs' results. C = 1000, h = 1, s = 1, 
K(x, y) = x*y, training set size = 190. 





cNN 


pNN 


iNN 


uNN 


oNN 


3 classes 


23.8 


22.3 


25.5 


24.3 


23.4 


5 classes 


42.0 


42.4 


42.7 


42.5 


42.8 


10 classes 


65.7 


66.6 


68.1 


68.3 


65.3 



(b) NNs' results, h = 1, without hidden layers. 



Table 8.6: MER (%) for the Machine CPU dataset. 



Chapter 9 

Conclusion 



This study focuses on the application of machine learning methods, and in particular of 
neural networks and support vector machines, to the problem of classifying ordinal data. 
Two novel approaches to train learning algorithms for ordinal data were presented. The 
first idea is to reduce the problem to the standard two-class setting, using the so called data 
replication method, a nonparametric procedure for the classification of ordinal categorical 
data. This method was mapped into neural networks and support vector machines. Two 
well-known approaches for the classification of ordinal categorical data were unified under 
this framework, the minimum margin principle [4] and the generic approach by Frank and 
Hall [22]. Finally, it was also presented a probabilistic interpretation for the neural network 
model. 

The second idea is to retain the ordinality of the classes by imposing a parametric model for 
the output probabilities. The introduced unimodal model, mapped to neural networks, was 
then confronted with established regression methods. 

The study compares the results of the proposed models with conventional learning algo- 
rithms for nominal classes and with models proposed in the literature specifically for ordinal 
data. Simple misclassification, mean absolute error, root mean square error, Spearman and 
Kendall's tau-b coefficients are used as measures of performance for all models and used 
for model comparison. The new methods are likely to produce simpler and more robust 
classifiers, and compare favourably with state-of-the-art methods. However, the reported 
results must be taken with caution. In most of the experiments the effort to find the correct 
setting for the parameters of the algorithms was limited (although unbiased among methods). 
So, although reasonable conclusions are drawn from the experiments, we do not wish to 
overstate our claims. 

This thesis has covered the multiclass classification accuracy and classifier simplicity, but a 
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brief word on speed is in order. Comparing different machine learning algorithms for speed 
is notoriously difficult; we are simultaneously judging mathematical algorithms and specific 
implementations. However, some useful general observations can be made. Empirically, 
SVM training time tends to be superlinear in the number of the training points [48] . Armed 
only with this assumption, it is a simple exercise to conclude that the complexity of the data 
replication formulation is placed between the simple approach of Frank and Hall and the 
pairwise of Herbrich. 

An issue intentionally avoided until now was the very own definition of ordinal classes. 
Although we do not wish to delve deeply on that now, a few comments are in order. 

Apparently, a model that restricts the search to noncrossing boundaries is too restrictive, 
imposing unnecessary and unnatural constraints on the solution, limiting this way the feasible 
solutions to a subset of what we would expect to be a valid solution to an ordinal data 
problem. On the other side, the unimodal model, more plausible and intuitive, seems to 
capture better the essence of the problem. However, it is a simple exercise to verify that 
the unimodal model does not allow boundaries' intersections - the intersection point would 
indicate an example where three or more classes are equally probable. For that reason, the 
unimodal model (parametric or not) seems to be a subset of the noncrossing boundaries 
model. It is also reasonable to accept that each noncrossing boundary solution may be 
explained by, at least, an unimodal model (however, there is not a bijection between the 
two, as different unimodal models may lead to the same noncrossing boundary solution; in 
fact, non-unimodal models may also lead to noncrossing boundaries). 

It is visible here a resemblance with the parallelism between parametric classifiers that must 
estimate the probability density function for each class in order to apply the bayes likelihood 
ratio test and classifiers that specify the mathematical form of the classifier (linear, quadratic, 
etc), leaving a finite set of parameters to be determined. 

We are not advocating any model in particular. Pragmatically, see them as two more tools: 
only testing will say which is best in a specific machine learning problem. 

Finally, as all unfinished jobs, this also leaves some interesting anchors for future work. 
As mentioned in this thesis, several unimodal models were implemented making use of a 
generic optimization function available in Matlab. It would be most interesting to adapt 
the backpropagation method to all unimodal models and perform a fair comparison. The 
data replication method is parameterised by h (and C); because it may be difficult and time 
consuming to choose the best value for h, it would be interesting to study possible ways 
to automatically set this parameter, probably as a function of the data and C. It would 
also be interesting to study if these algorithms can be successfully applied to nominal data. 
Although the data replication method was designed for ordinal classes, nothing impedes its 
application to nominal classes. It is expected that the classifier should be evaluated for each 
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possible order of the classes, choosing the one conducting to the best performance (feasible 
only when the number of classes is small). A systematic study of decision trees' algorithms 
for ordinal data is also indispensable. It would be a significant accomplishment to map the 
models introduced in this thesis to decision trees. 
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